How to tweak the NLTK sentence tokenizer

Question

I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick:

import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'

print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''

I don't expect perfection here, considering that Melville's syntax is a bit dated, but NLTK ought to be able to handle terminal double quotes and titles like "Mrs." Since the tokenizer is the result of an unsupervised training algo, however, I can't figure out how to tinker with it.

Anyone have recommendations for a better sentence tokenizer? I'd prefer a simple heuristic that I can hack rather than having to train my own parser.

vpekar · Accepted Answer · 2013-01-01 10:04:21Z

up vote 21 down vote accepted

You need to supply a list of abbreviations to the tokenizer, like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)

sentences is now:

['is THAT what you mean, Mrs. Hussey?']

Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')

edited Jan 1 '13 at 10:04

answered Dec 31 '12 at 15:08

vpekar
915511

Ah, good to know. Strangely, this does not work if I run the complete sentence in my question through your solution. Any idea why? – Chris Wilson Dec 31 '12 at 16:21

Just added some more info into the answer. – vpekar Jan 1 '13 at 10:05

I generally avoid 'thanks' comments, but here it really is at place: thanks! – Private Apr 13 at 9:52

How do you handle the special case where the sentence has an apostrophe but you want to get the offsets? i.e. using span_tokenize method . The suggested workaround changes the original offsets. – CentAu May 19 at 15:08

add a comment |

unutbu · Answer 2 · 2012-12-31 12:55:50Z

You can tell the PunktSentenceTokenizer.tokenize method to include "terminal" double quotes with the rest of the sentence by setting the realign_boundaries parameter to True. See the code below for an example.

I do not know a clean way to prevent text like Mrs. Hussey from being split into two sentences. However, here is a hack which

mangles all occurrences of Mrs. Hussey to Mrs._Hussey,
then splits the text into sentences with sent_tokenize.tokenize,
then for each sentence, unmangles Mrs._Hussey back to Mrs. Hussey

I wish I knew a better way, but this might work in a pinch.

import nltk
import re
import functools

mangle = functools.partial(re.sub, r'([MD]rs?[.]) ([A-Z])', r'\1_\2')
unmangle = functools.partial(re.sub, r'([MD]rs?[.])_([A-Z])', r'\1 \2')

sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

sample = '''"A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'''    

sample = mangle(sample)
sentences = [unmangle(sent) for sent in sent_tokenize.tokenize(
    sample, realign_boundaries = True)]    

print u"\n-----\n".join(sentences)

yields

"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs. Hussey?"
-----
says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"

Just what I needed -- thank you! – Chris Wilson Dec 31 '12 at 13:06 — Chris Wilson, Dec 31 '12 at 13:06
Update: Consolidated part of this answer with the one above – Chris Wilson Jan 1 '13 at 18:05 — Chris Wilson, Jan 1 '13 at 18:05

bjmc · Answer 3 · 2014-08-19 04:52:51Z

You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types. For example:

extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e' above. For details about the other tokenizer parameters, refer to the relevant documentation.

aidankmcl · Answer 4 · 2014-06-04 23:10:36Z

So I had a similar issue and tried out vpekar's solution above.

Perhaps mine is some sort of edge case but I observed the same behavior after applying the replacements, however, when I tried replacing the punctuation with the quotations placed before them, I got the output I was looking for. Presumably lack of adherence to MLA is less important than retaining the original quote as a single sentence.

To be more clear:

text = text.replace('?"', '"?').replace('!"', '"!').replace('."', '".')

If MLA is important though you could always go back and reverse these changes wherever it counts.

asked	2 years ago
viewed	7171 times
active	1 year ago

current community

your communities

more stack exchange communities

How to tweak the NLTK sentence tokenizer

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged python nlp nltk or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

How to tweak the NLTK sentence tokenizer

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python nlp nltk or ask your own question.

Linked

Related

Hot Network Questions